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Abstract 

A major impediment in the development of efficient full genome sequencing is the large 
portion of erroneous reads produced by sequencing platforms. Error correction is the com- 
putational process that attempts to identify and correct these mistakes. Several classical 
stringology problems, including the Consensus String problem, are used to model error cor- 
rection. However, a significant shortcoming of using these formulations is that they do not 
account for a few of the reads being too erroneous to correct; these outlier strings potentially 
have great effect on the solution, and should be detected and removed. We formalize the 
problem of error correction with outlier detection by defining the Consensus String with 
Outliers problem. Given n length-^ strings S = {si, . . . , s n } over a constant size alphabet 
£ together with parameters d and k, the objective in the Consensus String with Outliers 
problem is to find a subset S* of S of size n—k and a string s such that es* ^( s i: s ) — d- 
Here d(x, y) denotes the Hamming distance between the two strings x and y. We prove the 
following results: 

• A variant of Consensus String with Outliers where the number of outliers k is fixed and 
the objective is to minimize the total distance X^ s eS* d(si, s ) admits a simple PTAS. 
Our PTAS can easily be modified to also handle the variant of the problem where a 
hard upper bound d on the total distance is given as input, and the size of S* is to be 
maximized. The approximation schemes are simple enough that our results are best 
viewed as a performance guarantee on natural heuristics for the problem when the 
parameters of the heuristic are chosen appropriately. 

• Under the natural assumption that the number of outliers k is small, the PTAS for 
the distance minimization version of Consensus String with Outliers performs well. In 
particular, as long as k < cn for a fixed constant c < 1, the algorithm provides a 
(1 + e)-approximate solution in time f(l/e)(n£)°^ and thus, is an EPTAS. 

• In order to improve the PTAS for Consensus String with Outliers to an EPTAS, the 
assumption that k is small is necessary. Specifically, when k is allowed to be arbi- 
trary the Consensus String with Outliers problem does not admit an EPTAS unless 
FPT=W[1]. This hardness result holds even for binary alphabets. 

• The decision version of Consensus String with Outliers is fixed parameter tractable 
when parameterized by — zte- and thus, also when parameterized by just d. 

To the best of our knowledge, Consensus String with Outliers is the first problem that admits 
a PTAS, and is fixed parameter tractable when parameterized by the value of the objective 
function but does not admit an EPTAS under plausible complexity assumptions. Hence, the 
proof of our hardness of approximation result combines parameterized reductions and gap 
preserving reductions in a novel manner. 
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1 Introduction 



Although the laboratory methods that generate genetic sequence data have advanced remarkably 
since their initial use in the Human Genome Project [37J, the algorithms behind the computa- 
tional methods have not advanced as dramatically. Sajjadian et al. [35] describes the present 
time as "watershed moment in genomics" pointing to computational genomics as the bottle- 
neck of the sequencing process. In this paper, we revisit an essential problem arising in genome 
sequencing, reformulate this problem to better model noisy data, and show how studying approx- 
imability and parameterized complexity of this problem leads to surprising theoretical insights 
and algorithmic techniques that may assist in genome sequencing. 

Since the discovery of DNA as the basic unit of heredity, significant effort has been focused 
on automated determination of the sequence of nucleotides corresponding to a sample of DNA, 
a process referred to as genome sequencing. The key technology this process relies on is the 
sequencing platform that accepts a collection of biological (DNA) samples and produces reads 
from the samples. A read is a string from the alphabet {A, C, G, T} that represents the sequence 
of nucleotides in a sample. Sequencing platforms are extremely limited in that they cannot 
process the entire DNA sample at once but rather, they handle very small pieces of the DNA at 
a time. The resulting problem for an average-size genome of length 4 million is that ~20 million 
reads of length 50 must be assembled into one contiguous piece. This computational process of 
building the contiguous string from reads is referred to as fragment assembly, and is especially 
challenging-if not, impossible-for complex genomes with higher repeat and duplication content. 
While the current generation of sequencing platforms can produce a large amount of reads in 
a relatively short period of time, the reads they produce are greatly error prone, increasing 
the computational difficulty of fragment assembly. Error correction, which is vital in genome 
assembly, aims to identify and correct any mistakes made by the sequencing platform and thus, 
reduces the computational demands of the fragment assembly algorithms [34J . 

Contamination of the DNA sample and erroneous runs of the sequencing platforms are fre- 
quent occurrences that lead to many reads having a large fraction of errors and hence, deviate 
quite dramatically from the rest of the data. Ideally, these "outlier" strings should be detected 
and removed from the input prior to assembly. Although, problem formulations with outliers 
have been previously proposed and studied in different contexts-including machine learning 
PIS EH E2], network design problems [H EJ EB QSl ESI , and bioinformatics [3 [26] -it has 
not been considered or proposed in error correction of genome sequencing data. Present error 
correction methods do not account for the possibility of outliers, and hence, are required to be 
highly liberal in the elimination of data. Therefore, they remove a large number of reads that 
could have been used for assembly. We introduce the following formulation of error correction 
that also captures the existence of outliers in the data. 

Consensus String with Outliers 

Input: a selQof n length-^ strings S = {si, . . . , s n } over a finite alphabet S and nonnegative 
integers k and d. 

Question: Find a length-^ string s and subset S* of S of size n — k, where Yl\/ s -eS* ^( s ' — ^' 

We restrict interest to Hamming distance and denote d(x, y) to be the Hamming distance be- 
tween the length-^ strings x and y. The following are natural optimization versions of Consensus 
String with Outliers that we will consider: 

• Consensus String with Max Non-Outliers: given n length-^ strings S = {si, . . . , s n } over 
a finite alphabet £ and nonnegative integer d, the aim is to find a consensus string s and 
subset of S* C S, where \S*\ is maximal and X^vteS* ^( s > — ^ - 

1 Technically, this is a multi-set since we allow any string to occur multiple times. 
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• Min-distance Consensus String with Outliers: given n length-^ strings S = {s±, . . . ,s n } 
over a finite alphabet E and nonnegative integer k, the aim is to find a consensus string 
s and subset of S* C 5, where n — \ S*\ = k and X^vteS* d( s > ^ s m i n i mai - 

Our Results. The problems considered are NP-hard in general, however, they turn out to be 
amenable to approximation and parameterized algorithms. A polynomial-time approximation 
scheme (PTAS) for a minimization problem is an algorithm which takes an instance of the prob- 
lem and a parameter e > and, in polynomial time, produces a solution that is within a factor 
1 + e of being optimal. If the exponent of the polynomial in the running time of the algorithm 
is independent of e then the PTAS is said to be an efficient PTAS (EPTAS). We present sev- 
eral results on the ability to efficiently solve and approximate the above optimization problems 
within arbitrarily small factors, and demonstrate the tightness of these results. Specifically, we 
prove the following: 

• There exists a deterministic PTAS for Min-distance Consensus String with Outliers and 
Consensus String with Max Non-Outliers. 

• For instances where k < cn and fixed c < 1, the PTAS for Min-distance Consensus String 
with Outliers can be improved to a randomized EPTAS. 

• In the general case, both Min-distance Consensus String with Outliers and Consensus 
String with Max Non-Outliers do not admit an EPTAS, unless FPT=W[1]. Thus, the 
requirement that k < cn is necessary to improve the PTAS for Min-distance Consensus 
String with Outliers to an EPTAS. 

• Consensus String with Outliers can be solved to in time (J^^lE^n 9 , where 5 = d/(n — k). 

For a parameter 5, an algorithm with running time f(5)n°^ is called a fixed parameter tractable 
(FPT) algorithm for the problem parameterized by 5. Parameterized problems that admit such 
algorithms are said to be FPT. Hence our algorithm for Consensus String with Outliers proves 
that the problem is FPT parameterized by S. 

Our approximation schemes are based on random sampling. If the number of outliers is 
small, then with reasonably high probability a small random subset of the input strings will 
not contain any outliers. If the random sample does not contain outliers then the sample can 
be used to estimate the optimal consensus string. We show that if the size of the sample and 
the number of repetitions of the experiment are chosen appropriately then there exists a good 
bound on the quality of the output of this natural heuristic. For inputs where the noise does not 
completely overwhelm the data, i.e. when k < cn for c < 1, the dependence on the running time 
of our approximation scheme for Min-distance Consensus String with Outliers is good; more 
specifically, it is an EPTAS. 

The difference in running time of a PTAS and an EPTAS can be quite dramatic. For 
instance, running a 0(2 1 / € n)-time algorithm is reasonable for e = jk and n = 1000, whereas 
running a 0(rz 1 / e )-time algorithm is infeasible. Hence, considerable effort has been devoted 
to improving PTASs to EPTASs, and showing that such an improvement is unlikely for some 
problems. For example, Arora [3] gave a n°( 1//e )-time PTAS for Euclidean TSP, which was 
then improve to a 0(2°( 1//e ^n 2 )-time algorithm in the journal version of the paper [6j. On the 
other hand Independent Set admits a PTAS on unit disk graphs [21] but Marx |30j showed that, 
unless FPT=W[1], it does not admit an EPTAS. Many more examples of PTASs that have 
been improved to EPTASs, and problems for which a PTAS exists but for which an EPTAS has 
been ruled out under the assumption that FPT^W[1] can be found in the survey of Marx |31j . 
An interesting question is whether the requirement that k < cn for c < 1 is necessary in order 
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to improve the PTAS for Min-distance Consensus String with Outliers to an EPTAS. Can an 
EPTAS be obtained for this problem without the requirement? 

A useful observation in this regard is that an EPTAS for an optimization problem auto- 
matically yields a FPT algorithm for the corresponding decision problem parameterized by the 
value of the objective function [31]. More specifically, if we set e = where a is the value of 
the objective function, then a (1 + e)-approximation algorithm would distinguish between "yes" 
and "no" instances of the problem. Hence, an EPTAS could be used to solve the problem in 
0(/(e)n°W) = 0(g(a)n°^)-time. This observation is frequently used to rule out the existence 
of an EPTAS. If a problem does not admit a FPT algorithm parameterized by the value of the 
objective function unless FPT=W[1], then the corresponding optimization problem does not 
admit an EPTAS unless FPT=W[1]. 

To the best of our knowledge all known results ruling out EPTASs for problems for which 
a PTAS is known use this approach. Unfortunately, it cannot be used to rule out an EPTAS 
for Min-distance Consensus String with Outliers because Consensus String with Outliers pa- 
rameterized by d is FPT. In particular, we show there is an algorithm for Consensus String 
with Outliers with running time 5°^m s n 9 , where 5 = d/(n - k), and since 5 is always at 
most ci-and much smaller than d for most inputs-this algorithm runs in 0(d ( rf )|X| rf ra 9 )-time. 
Our FPT algorithm is an adaptation of the algorithm by Marx [29] for the Consensus Patterns 
problem. 

In his survey, Marx |31] introduces a hybrid of FPT reductions and gap preserving reductions 
and argues that it is conceivable that such reductions could be used to prove that a problem that 
has a PTAS and is FPT parameterized by the value of the objective function does not admit 
an EPTAS unless FPT=W[1]. We show that Min-distance Consensus String with Outliers does 
not admit an EPTAS unless FPT=W[1], giving the first example of this phenomenon. At the 
core of our reduction is an analysis of one-dimensional random walks where some of the steps 
are "double steps" that are taken in the same direction. The results on random walks could turn 
out useful in other hardness proofs, and thus, might be of independent interest. Parameterized 
hardness results for a few other parameterizations of Consensus String with Outliers follow as 
simple corollaries of our construction. 

Related Work 

The problems considered in this paper belong to the more general class of stringology problems 
where a set of strings is given and the aim is to determine a single string that is representative 
for the set. The exact definition of what being a good representative means may vary and dif- 
ferent definitions lead to abstractions of various problems in bioinformatics [26]. The Consensus 
Patterns problem is quite similar to our problem, however, in this context the aim is to find 
a substring in each of the input strings and consensus string so that the sum of the Hamming 
distances is minimized. Li et al. [27] gave a PTAS for this problem and there has been a sig- 
nificant effort in attempting on proving tighter bounds on the running time of the PTAS 18] . 
The Closest String problem is another related problem where the goal is to find a string that 
minimizes the maximum Hamming distance to any string. This problem also admits a PTAS 
but no EPTAS [28]. Both problems have been investigated in the framework of parameterized 
complexity by several authors, however, the parameterization of Consensus Patterns with re- 
spect to the distance appeared to be very challenging. In 2005, Marx |29j showed Consensus 
Patterns is FPT when parameterized by 5 = d/n and bounded alphabet size. 

Overview of DNA Fragment Assembly 

The general approach to large-scale sequencing is as follows: first the DNA is extracted from 
the cell and copied multiple times, then the DNA is cut into smaller fragments, each fragment 
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is sequenced by a sequencing platform to produce a read, and finally the reads are assembled 
into large segments of the genome. Figure [I] illustrates this process. One important point is 
that sequencing platforms can produce many hundreds of thousands, or even millions of reads 
in a short time (on the order of a day) , but can only handle small segments of DNA at a time 
and produce relatively short reads. The copying step ensures that a position of the genome is 
sequenced multiple times and the reads overlap by an adequate amount. This overlap is what 
allows for the assembly of the reads into large contiguous strings. Developing novel algorithms 
and tools for the fragment assembly process is, at present, a very active area of research in 
bioinformatics. Current assembly tools are efficient, however, their accuracy is substantially 
diminished by repeated regions in the genome sequence and sequencing errors. 

As previously mentioned, error correction of the reads is an important step in genome 
sequencing, however, present algorithms are still unable to handle outliers in the data. The 
majority of sequencing errors occur when the nucleotide found in a read deviates from the 
actual nucleotide in the DNA sample (i.e. a read has the symbol A at a position where it should 
be a C), making Hamming distance the most reasonable metric to use. In a read, for the first 50 
positions the error rate is quite small but for subsequent positions the error probability increases 
exponentially \13\ I36|. See Figure [Tj This is why the length of the reads is at most 70-100. Due 
to this change in the error probability and technical details related to fragment assembljj^J error 
correction begins by computing the set of all consecutive, length-^ substrings from each read, 
where i is an input parameter. Hence, error correction is implicitly performed on the set for all 
length-^ contiguous substrings of reads rather than the full reads. 

The majority of error correction algorithms consider the first 50 positions in a read and ignore 
remaining positions |25[ I34| 138], eliminating a large portion of the data. This is unsatisfactory 
since acquiring the data is both expensive and time consuming, and any loss of data will affect 
the accuracy of the assembly. Due to the change in the error probability in the reads, some 
of the length-^ strings will have a significant but tolerable number of errors (i.e. up to 15% of 
positions being erroneous) that can be error corrected and thus, used in fragment assembly. 
On the other hand, the length-^ strings that stem from contaminated data or bad runs of the 
sequencing platform should be detected and removed. 

Preliminaries 

A maximization problem admits a PTAS if there is an algorithm A(I, e) such that, for any e > 
and any instance I of A(I, e) outputs a (1 — e)-approximate solution in time |Z|-^ 1//<E ) for some 
function /. A PTAS for a minimization problem finds a (1 + e)-approximate solution in time 
|X|^( 1 / e ). An approximation scheme where the exponent of \X\ in the running time is independent 
of e is called an efficient polynomial time approximation scheme (EPTAS). Formally, an EPTAS 
is a PTAS whose running time is f(l/e)°^ |^ZT| ^Ci) . 

We give a brief introduction to paramterized complexity. A problem <p is said to be fixed 
parameter tractable with respect to parameter k if there exists an algorithm that solves ip in 
f(k) ■ time, where / is a function of k that is independent of n |12j . The class of all fixed 
parameter tractable problems is denoted by FPT. The class W[l] of parameterized problems is 
the basic class for fixed parameter intractability, FPT C W[l] and the containment is believed 
to be proper. A parameterized problem n with the property that an FPT algorithm for n would 
imply that FPT=W[1] is called W[l]-hard. Downey and Fellows [12J define fpt-reductions, which 
preserve W[l]-hardness. 

Let L,L' C Y2* x ^ be two parameterized problems. We say that L fpt-reduces to L' if 
there are functions /, g : N — > N and an algorithm that given an instance (X, k) runs in time 
f(k)\T\f( k ^ and outputs an instance (I',k') such that k! < g(k) and (I,k) G L (I',k') € 

2 We leave out these details in this paper and direct interested readers to the work of Pevzner et al. |34j 
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2. Fragmentation and amplification 



3. Sequencing into reads 



a 

.a 
o 

a. 



AACGGAGATGACACAAA ACGGAGATGACACCA 
ATTATATAATCCACACACACA GGAATAATAGGG 
ACGGAGATGGGGGGGGACA TATATTATCCACACACACA 



4. Error correction and fragment assembly 




Read Position 



_ . ✓ Error probability of reads. 



ATTATATAATGGGGGGACAACGGAGATGACACCCCCCCAAAAAAAACTACA 



Figure 1: A visualization of the basic steps needed for whole genome assembly of a biological 
sample. The probability that a character in a read was sequenced incorrectly is highly dependent 
on its position within the read. The change in the error probability with respect to the read 
length is illustrated [13J. 



L' . These reductions work as expected; if L fpt-reduces to V and L' is FPT then so is L' . 
Furthermore, if L fpt-reduces to V and L is W[l]-hard then so is V . We refer the reader to the 
textbooks P21 |33l [H] for a more thorough discussion of parameterized complexity. 

Let s be a string over the alphabet S. We denote the length of s as |s|, and the jth character 
of s as s[j]. Hence, s = s[l]s[2] . . . s[\s\}. For a set S of strings of the same length we denote by 
S[i] as {s[i] : s G S}. That is, if the same character appears at position i in several strings it is 
counted several times in S[i]. For an interval P = {i, i + 1, . . . ,j — of integers, define s[P] 
to be the substring s[i]s[i + 1] . . . s[j] of s. For a set S of strings and interval P define S[P] to be 
the (multi)set {s[P] : s G S}. For a set S of length-^ strings the consensus string of 5, denoted 
as c(S), is such that c(S')[i] is the most-frequent character in S[i] for all i < I. Ties are broken 
by selecting the lexicographically first such character, however, we note that the tie-breaking 
will not affect our arguments. 

We denote the sum Hamming distance between a string, s, and a set of strings, S, as d(S, s). 
Observe that the consensus string c(S) minimizes d{S, c(S))— that is no other string x is closer 
to S than c(S). However, some x / c(S) could achieve d(S, x) = d(S, c(S)) and we refer to such 
strings as majority strings because they are obtained by picking a most-frequent character at 
every position with ties broken arbitrarily. The Consensus String With Outliers problem can 
now be succinctly stated as follows: given a set S of strings and integers k and d, the objective 
is to find a subset S* C S of size n* = n — k such that d(S, c(S)) < d, if it exists. 

Given a subset S* C S we can compute c(S*) in polynomial time by choosing a majority 
string for c(S*). If we are given c(S*) for the optimal solution S* (but not given S* itself) 
then we can recover S* from c(S*) and S in polynomial-time since S* is the n — k strings in 
S that are closest to c(S*). Similarly, given any string x, we denote S x as the subset of S 
containing the n* strings closest to x. By construction S x satisfies the following inequality: 
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d(S',x) > d(S x ,x) > d(S x ,c(S x )) for any subset S' of S of size n* . 



2 Approximating Consensus String with Outliers 

We prove the existence of a PTAS for the Min-distance Consensus String with Outliers problem. 
Our algorithm is based on random sampling. For a given value of e, the algorithm selects a 
value for the parameter r based on e, picks r strings S' = (s'j, s' 2 , .:S' r ) from S uniformly at 
random (with replacement), and returns the consensus string corresponding to «S". The next 
lemma shows that if S" was taken from a (unknown) optimal solution S* , rather than from the 
entire input set S, then in expectation c(S') is almost as good the consensus string for the set 
S*. 

Our arguments rely on well-known concentration bounds for sums of independent random 
variables. We use the following variant of the Hoeffding's bound [[23] given by Grimmett and 
Stirzaker [2S p. 476]. 

Proposition 1. (Hoeffding's bound) Let X\, X2, ...X n be independent random variables 
such that at < Xi < b% for all i. Let X = XjAj and the expected value of X be E[X] then it 
follows that: 

Pr[X - E[X] >t}< exp ( ~f 

Lemma 1. For all e > and a, there exists a value of r such that the following holds: if S is 
a set of length-l strings over the alphabet S, where |S| = a, and S' is a subset of S of size r, 
(s^s'2, ...s' r ), chosen uniformly at random, then E[d(S,c(S'))] < (1 + e)d(S,c(S)). 

Proof. We prove that there exists a r such that E[d(S,c(S'))] < (1 + 2e)d(S,c(S)). Applying 
this weaker inequality with e' = e/2 then proves the statement of the Lemma. We assume, 
without loss of generality, that c(S) is equal to 0^, e < 1/16, and r > 8. We restrict interest to 
column i of S, where < i < I, let di be the number of nonzero symbols in column i and let 
Zi = n — di. Observe that d(S,c(S')) is equal to the sum over i of the number of strings s £ S 
such that s[i] 7^ c(S')[i]. By linearity of expectation it is sufficient to prove that for every i we 
have E[d{S[i],c(S')[i])} < (l + 2e)d t . 

First, we assume di is at most en. Let q be the probability that c(5")[i] 7^ 0. It follows that 
i?[(i(5[i], c(5')[i])] is at most di(l — q) + qn. We determine an upper bound on the probability 
q as follows: 

\ r 1 — (J I "i^/ 2 ! 
r \ ( A. In\ x (^ — A. ln\ r ~ x <r ST V (A. /r.Y* <r or (a. /^W/i] 1 {aj/n) 




q< (^(WC 1 - di/n) r - x < 2 r (d t /nf <2 r (d t /n 

x=\r/2\ — r_/m 

Since di/n < e < 1/16, we get: 

q < 2 r+1 (di/nfM < jrW e rr/41 {di/n) lr/±} < r ^ . 2 ( ^ /n) ^/4j = 2 {d . /n 

•hows from the last inequality, an 
following bound on E[d(S[i], c(S')[i])]: 



1 - (di/n) 

x=\r/2\ v ' x=\r/2\ K 1 ' 



[r/41 

Lr/4j 

It follows from the last inequality, and that r > 8, that q < 2 (di/n) 2 . Hence, we obtain the 



E[d(S[i\,c(S')[i\)] <di(l-q) + qn<di + 2(^) n< (1 + 2e)di 

Next, we assume that di > en. We say that a symbol a £ £ is a good symbol if there 
are at least Zi — ne 2 strings in S that have the symbol a at column i; any symbol that is not 
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good is bad. If c(S")[i] is a good symbol then d(S[i], c(S')[i}) is at most di + ne 2 and hence, is 
at most (1 + e)di since di > en. Let p be the probability that c(5')[i] is a bad symbol then, 
c(5')[i])] is upper bounded by (1 — p)(l + e)di + pn. Lastly, we determine an upper 
bound on p to complete the proof. 

Let a be a bad symbol and p a be the probability that c(S")[z] is equal to a. We note that 
in order for c(S")[i] to be a, there has to be more positions equal to a than in S'[i]. Let 
X be the difference between the number of positions equal to a and the number of positions 
equal to in S'[i]. It follows that p a < Pr[X > 0]. Let Xj be an indicator variable which 
is 1 if s'Ai] is equal to a, -1 if it is equal to 0, and otherwise. Since a is a bad symbol, 
there are at least e 2 more positions equal to than positions equal to a in S'[i] and therefore, 
i?[X,] = Pr[s^-[i] = 0] — Pr[,Sj[i] = a] < —e 2 . By linearity of expectation, we obtain E[X] = 
Tr j=1 E[Xj] < -re 2 . Using this inequality, we get Pr[X > 0] < Pr[X - E[X] > re 2 ]. Since the 
Xj variables are independent and difference between the upper and lower bound of Xj is 2, we 
can use Hoeffding's inequality to obtain the following bound. 

2 , . f-2r 2 e A \ fre 4 



PrLY - E[X] > re 2 ] < exp f ^ 2 ) = exp 

By choosing r = max ( — 8^ , we get p a < ^. Finally, we bound p as follows: p < ^2p a < 
v / a 

a— = e 2 . We can now use the upper bound on p and our assumption that di > en to bound 

E[d(S[i],c(S'M)]: 

E[d(S\i],c(S')\i])] < (l-p)(l + e)di+pn<(l + e)di + e 2 n<(l + 2e)di. 
This concludes the proof. □ 

Lemma [l] gives a simple, deterministic PTAS for Min-distance Consensus String with Out- 
liers. 

Theorem 1. There exists a PTAS for Min-distance Consensus String with Outliers. 

Proof. It follows from Lemma [l] that there exists an integer r such that E[d(S* , c(S'))] < 
(1 + e)d(S* , c(S*)) if S' , the set of r of strings chosen from S, is from an (unknown) optimal 
solution S* . Some subset S' of S* must achieve expectation. The algorithm guesses this set S' by 
trying all possible n r subset of S of size r. Let x = c(S'). The algorithm returns the set S x of the 
n* strings closest to x. This set satisfies d(S x , c(S x )) < d(S x ,x) < d(S*,x) < (l + e)d(S*,c(S*)), 
concluding the proof. □ 

If the number of outliers k is small compared to n, i.e. k < n/2, then with probability l/2 r 
a random subset S' of r strings is a subset of an optimal solution S*. We use this to give a 
randomized EPTAS for Min-distance Consensus String with Outliers. 

Theorem 2. There exists a randomized EPTAS for Min-distance Consensus String with Out- 
liers for inputs when k < cn for c < 1. The algorithm runs in time " f(^)( n ^)°^ and 
outputs a (1 + e)- approximate solution with probability 1/2. 

Proof. We give a polynomial-time algorithm that returns a (1 + e)-approximate solution with 
probability (1 — c) r • /(e). Repeating this algorithm O ( (i- c ) 1 '-/(e) ) ti mes then yields the state- 
ment of the theorem. The algorithm selects a value for r such that for a random subset S' of 
the unknown optimal solution S* the inequality E[d(S* , c(S'))] < (1 + § )d(S*, c(S*)) holds. It 
follows from Lemma [T] that this can be done so that r only depends on e. Next, r strings from 
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S are selected uniformly at random (with replacement) to form a subset S' . Let x = c(S'). The 
algorithm then returns the set S x of the n* strings closest to x. 

It remains to find a sufficient lower bound of the probability that the returned set is a (1 + e)- 
approximation. Since k < cn, it follows that the probability that S is taken from an (unknown) 
optimal solution S* is at least ( "~ cw ) r = (1 — c) r . If S' is taken from S* then by Lemma [l] 
we have that E[d(S* , c(S'))] < (1 + |)d(5*, c(S*)). Next, we assume otherwise. By Markov's 
inequality [201 P- 311] the probability that d(S* ,c(S')) exceeds expectation by a factor at least 
1 + | is at most j^l- Hence, with probability /(e) for some function / of e we have that: 

d(S*,c(S')) < (l + |) d(S*,c(S*)) ■ (l + |) , 

which is at most (1 + e)d(S* , c(S*)) when 2 (|) 2 < |. In particular, this holds if e < 1/3, 
concluding the proof. □ 

The best way to view Theorem [2] is as a performance guarantee on a natural heuristic for 
the problem when the parameter r is chosen appropriately. We note that one would expect 
natural inputs to contain substantially fewer outliers than n/2, and that Markov's inequality is 
a very pessimistic bound for the probability of achieving expectation. Hence, it is likely that 
for reasonable inputs the above algorithm will perform much better in practise than the proved 
bounds. 

We now show that the PTAS for the Min-distance Consensus String with Outliers problem 
can be extended to obtain a PTAS for the Consensus String with Max Non-Outliers problem. 
We recall that for this optimization problem, we are given S and an integer d and asked to find 
a set S* C S that maximizes and satisfies the constraint d(S* , c(S*)) < d. 

Theorem 3. There exists a PTAS for Consensus String with Max Non-Outliers. 

Proof. We give a (1 — 2e)-approximation algorithm that runs in 0((nl)f^)-time. We denote 
an (unknown) optimal solution as S*, and let n* = \S*\. A subset S' C S is said to be feasible 
if d(S',c(S')) < d. First, the algorithm enumerates all subsets of S of size at most 1/e and 
keeps the largest feasible set. Next, the algorithm guesses n* (by trying all possibilities) and 
applies the algorithm from Theorem [T] to find a set S x of size n* and a string x such that 
d(S x ,x) < (1 + e)d(S* , c(S*)) < (1 + e)d. It then constructs S" by removing the \en*~\ strings 
furthest away from x from S x . Since 

d{S",c(S")) < d(S",x) < (l-e)d(S x ,x) < (1 - e)(l + e)d < d, 

it follows that S" is feasible. The algorithm returns either S" or the largest feasible set found 
in the first phase, which ever is largest. The running time is clearly bounded by O ((n£)f^), 
so it remains to prove that the returned set is in fact a (1 — e)-approximation of S* . If n* < - 
the algorithm finds and returns S* . Next, if n* > - it follows that \S"\ > \S'\ — \en*~\ > 
n*(l — e) — 1 > n* (1 — 2e), concluding the proof. □ 

3 Hardness Results 

For reasonable instances of the Min-distance Consensus String With Outliers problem, we ex- 
pect the number of non-outliers to be greater than the number of outliers. As we have seen, 
Theorem [2] gives an EPTAS for most instances of the Min-distance Consensus String With Out- 
liers problem-namely those where k < cn for c < 1. When k cannot be upper bounded in this 
manner then the noise is much stronger than the signal, and there is little hope for accurate 
error correction. Further, Min-distance Consensus String with Outliers should not be seen as 
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an error correction problem when k is almost equal to n, but rather the problem of finding the 
"densest possible" cluster of points in Hamming space. Determining whether the requirement 
that k < cn for c < 1 is necessary in order to improve the PTAS from Theorem [T] to an EPTAS 
warrants further investigation. Now, we prove this requirement is unavoidable since the gen- 
eral version of Min-distance Consensus String with Outliers does not admit an EPTAS unless 
FPT=W[1]. 

Theorem 4. There exists no EPTAS for Min-distance Consensus String With Outliers, unless 
FPT = W[l]. 

The proof of Theorem|4]is by reduction from the MultiColored Clique (MCC) problem. Here 
input is a graph G, an integer k and a partition of V(G) into V\ l±) V2 ■ ■ ■ Vy- such that for each i, 
G[Vi] is an independent set. The task is to determine whether G contains a clique C of size k. 
Observe that such a clique must contain exactly one vertex from each Vi , since for each i we have 
Cf]Vi < 1. It is known that MCC cannot be solved in time f{k)n 0{ - l \ unless FPT=W[1] [To] . 

Given an instance (G,k) of MCC we produce in time f(k)n°^ an instance (S,n*) of Min- 
distance Consensus String with Outliers with the following property. If G has a /c-clique then 
there exists an S' C S of size n* such that d(S' , c(S')) < D yes , whereas if no A;-clique exists in 
G then for each S' C S of size n* we have d(S',c(S')) > D no . The values of D yes and D no will 

be chosen later in the proof, but the crux of the construction is that D no > (l + j^j^j Dyes- 
Hence, one could use the reduction together with an EPTAS for Min-distance Consensus String 
with Outliers setting e = 2 h(k) ^° s °l ve the MCC problem in time g(k)n 0<yl \ This reduction is 
a parameterized, gap-creating reduction where the size of gap decreases as k increases but the 
decrease is a function of k only. 



Construction. We describe how the instance (S, n*) is constructed from (G,k). Our con- 
struction is randomized, and will succeed with probability |. To prove Theorem 4 we have to 
change the construction to make it deterministic but for now let us not worry about that. We 
start by considering the instance (G, k) and let E(G) = {e±, . . . e m }. We partition the edge 
set E(G) into sets E p>q where l<p<q<kas follows; e$ E E p>q if ej = uv, u E V p and v E V q . 

Edges of G are unordered pairs uv of vertices of G. An edge endpoint e is an ordered pair 
(u, v) of vertices of G such that uv is an edge of G. We denote the set of all edge endpoints of 
G by E{G) = {e%, e-2, ■ ■ ■ e2m\- There are two edge endpoints that correspond to the same edge. 
For two edge endpoints e p and e g that both correspond to the edge e r we say that e p ~ e q , 
e p ~ e r and that e q ~ e r . For every i < k define the set Ei = {(it, v) E E : u E Vi}. 

Based on G and k, we select two integers l\ and £2, that satisfy the following proerties; 
l± = f ■ log n, £2 = 9 • £\ for some / > 1 and g > 1 that depend only on k. The exact value of £\ 
and £2 will be discussed later in the proof. We construct a set Z = z\, Z2, ■ ■ ■ Z2 m of strings, Z 
will act as a "pool of random bits" in our construction. For each endpoint e, E E(G) we make 
a string follows. 

z i = a i oa i ...oa i ob i o ^ ... 6- o ^ 06^ . . . o \ 

For every p, aS is a random binary string of length l\. For every p and q, b^' q is a random 
binary string of length £2. For each p and vertex u £ V p we make an identification string id(u) 
of length t\. Let i be the smallest integer such that the edge endpoint e% is (u,v) for some v. 
We set «d(it) = af . Similarly, for every pair of integers p < q and each edge e E make an 
identification string id(e) of length £2- Let i be the smallest integer such that e% ~ e. We set 
id(e) = 6f y . We now make the set 5" of strings in our instance. For each endpoint e.{ E E(G) 
we make a string Sj as follows. 

12 fc rl,2 .1,3 u l,k ,2,3 ,2,4 
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Here a? = id{u) if = (u, v) £ E p and aP i = a? otherwise. Also, b?' q = id{uv) if ~ uv, u £ V p 
and v £ V q . Otherwise b\' q = &f' . We refer to aj through a\ as the vertex blocks of Si and the 
6? ,<? 's are the edge blocks of Sj. We refer to afs as the p'th vertex block and to the the tP ,q s as 
the (p,g)'thedge block. We set n* = 2(*), L = k-£ 1 + (l) -£ 2 , and N = \S\ = 2m, this concludes 
the construction. Recall that n* is the size of the solution S* sought for and observe that L is 
the length of the constructed strings in S. 

We consider the constructed strings Sj as random variables, and for every j the character Si\j] 
is also a random variable which takes value 1 with probability 1/2 and with probability 1/2. 
Observe that for j / j' and any i and i' the random variables Si[j] and are independent. 

On the other hand Si[j] and Si'[j] could be dependent. However, if Si[j] and Si>[j] are dependent 
then, by construction Si[j] = Si>\j]. Let S* C S such that = n*. Here we consider S 1 * as a set 
of random string variables, rather than a set of strings. We are interested in studying d(S* , c(S*)) 
for different choices of the set S*. We can write out d(S*,c(S*)) as J2p=i d(S*\p], c(S*)\p]) and 
so d(S* , c(S"*)) is the sum of L independent random variables, each taking values from to n*. 
Thus, when L is large enough d(S*,c(S*)) is sharply concentrated around E[d(S* , c(S*))]. 

We turn our attention to E[d(S* , c(S*))] for different choices of S*. The two main cases that 
we distinguish between is whether S* corresponds to the set of edge endpoints of a clique in G 
or not. Before proceeding to these cases, we need some additional definitions. Let v be a vector 
of positive integers. We define the random variable X# = W • v where W is a random vector 
with same dimension as v, such that each coordinate of W is drawn from { — 1, 1} uniformly at 
random. The variable X$ is interpreted as follows: start a one-dimensional random walk at 0, 
in each step of the walk we go left or right with probability 1/2. However, the length of the 
different steps varies, in step i the walk jumps v[i] to the left or right. The value of X$ is the 
offset from the origin at the end of the walk. The total length of the random walk is J2i vft] 
whereas the number of steps of the walk is the dimension of v 

Let j be a position in an edge block. What we mean by this is that Si[j] is a character 
in yf' q . Suppose no two strings of S* correspond to edge endpoints of the same edge. Then 
d(S*[j],c(S*[j])) is distributed as n*/2 — \X^\ where v is a n* -dimensional vector of Is. Specif- 
ically for all Si £ S* the Si[j]s are independent so c(S*[j]) is the majority character out of n* 
characters independently drawn from {0, 1}, and d(c(S*[j], S*[j])) is the number of occurrences 
of the minority character. This is distributed as n*/2 — \X$\. 

Again, let j be a position in the (p, g)-edge block, but now suppose that S* contains t 
pairs of edge endpoints that correspond to the same edge in E Pjq . S* can also contain single 
endpoints of edges from E PA or both endpoints of edges in E p > tq > for (p',q') / (p,q) but we do 
not count these. From the construction of the (p,q)-edge block it follows that d(S*[j],c(S*[j])) 
is distributed as n*/2 — \X$\ where v is a n* — t dimensional vector with t entries of value 2 and 
n* — 2t entries with value 1 . We define the random variable X l r t = i + X$ where v is a vector 
with r — 2t entries that are 1 and t entries that are 2. Intuitively X l rt is the offset from of a 
random walk starting at i of length r, with t steps of length 2 and the remaining steps of length 
1. We set x\.^ t = E\\X % r j\\. Finally, we define E yes as 

E yes = k-£ 1 - (n*/2 - x£V 1)0 ) + (2) • h • (n * /2 " x °v) (1) 

Lemma 2. Let S* be a subset of S of size n* that corresponds to the set of edge endpoints of a 
k-clique in G. Then E[d(S* , c(S*))] = E yes . 

Proof. For each position j in a vertex block, consider the distribution of d(S*[j],c(S*)[j]). 
There are k — 1 edge endpoints in S* which are all incident to the same vertex v, so the strings 
corresponding to these endpoints all have the same character at position j. The remaining 
strings all have random characters at this position. Hence d(S*[j],c(S*)[j]) is distributed as 
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n* /2 — \X$\ where v is a n* — (k — 2) dimensional vector with n* — (k — 1) entries of value 1 and 
one entry with value k — 1. It is easy to see that is in fact distributed as \X^~^ k+1 | since 
we can make the step corresponding to the entry of value k — 1 first, and this step will take the 
random walk to position k — 1 or —{k — 1), but with respect to distance from these positions 
are symmetric. Since there are k ■ i\ positions in vertex blocks this accounts for the first term 
of the equation. 

For each position j in an edge block (p, q) there are two strings in S* that correspond to edge 
endpoints of the same edge in E p ^ q . These two strings have the same character at position j. All 
the other strings in S* correspond to edge endpoints of strings in E p ^ q i where p' ^ p or q' ^ q. 
The characters at position j for these strings are drawn independently. Hence d(S* [j], c(S*)[j]) 
is distributed as n*/2 — E^A^* Since there are (2) • ^2 positions in edge blocks this accounts 
for the second term of the equation. □ 

We now proceed to show that for any set S* that does not correspond to a set of edge 
endpoints of a /c-clique in G, E[d(S* , c(S*))] is at least factor e greater than E yes , where e 
depends only on k. Let E* be the set of edge endpoints corresponding to S* . Define E* to be 
the set of edges uv G E(G) such that (u,v) G E* and (v,u) G E*. Clearly, \E*\ < (Jj), hence 
if Ep !q n E* 7^ for every p, q then \E Pjq f]E*\ = 1 for every p,q. We start by proving that if 
there exists a p, q such that E pA n E* = then E[d(S*, c(S*))] is big. This proof is based on 
"differentiating" x^* t with respect to t. In particular for integers i, r, t such that r > 1 and 
t > 2 define 5x % r t = x l r t — x\ t _ 1 . 

Claim 1. x^* < 1 . If n* is divisible by 4 then 5x^* 1 > 5x^* t f 0T all t > 1. Furthermore, 
for every i,t and r we can compute x % rt in time polynomial in i and r. 

The intuition of Claim [T] is as follows. A random walk with double steps is just the sum of 
independent random variables, with variables corresponding to single steps taking values from 
{— 1, 1} and variables corresponding to double steps taking values from {—2, 2}. A double step 
has higher variance than the sum of two single steps. Hence, if we do a random walk starting 
from of total length n* with t double steps, then the expected distance from should increase 
as t increases. Furthermore, as t increases the variance of the random walk increses linearly, so 
the standard deviation increases less and less with each increment of t. Thus it is natural to 
expect that as t increases, each successive step increases the expected offset from less and less. 
Quite surprisingly this does not hold in general (we do not prove this, as it is not important for 
our results). However, when the length of the random walk is a multiple of 4, the claim does 
hold. 

Proof of Claim^ Recall that x\ t = E[\X'^. t \] where X l rt is a random variable denoting the final 
position of a random walk of length r, with t double steps, starting at i. Here i is an integer and 
might be negative. Conditional expectation yields the following recurrence for x\ t , r > 2t > 0. 



x 



r,t 



\i\ if r = 0, 

(z£i,t + < 1 i,t)/2 ifr>2t, 

l(4tL_i + 4-L-i)/ 2 if ^ L 



It is easy to see that one of the three cases must apply when r > 2t > - and x\ t is only defined 
for these values. Observe that if r > It and t > 1 then both the second and the third case apply. 
The recurrence above also yields a polynomial time algorithm to compute x\ t . The recurrence 
above together with definition of bx\ t yields the following recurrence for Sx l rt , for r > 2t and 
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t > 1. 



8x 



r.t 





1/2 
1 

(fojti, t + fojzi, t )/2 
UKtL-i + K-L-i)/ 2 



if r 
if r 
if r 



2,\i\ 
2,|*| 
2,|»| 



>2, 

= 1, 
= 0, 



if r > 2i, 
if t > 2. 



A straightforward induction using this recurrence shows that 5x® 1 > for all r > 0, proving 
" ' -° Define 5 2 xl + = 8xi+ — Sxl 



that x n * q <C *^ri* i 



following recurrence for 5 x 



rt — vju Tt — ujL rt _ l . Observe that S 2 x l rt is only well defined 
when r > 2t and t > 2. Inserting the recurrence for <fa* )t into the definition of 6 2 x l rt yields the 



r.f 



8 2 x 



r.t 





1/8 
1/4 
-1/8 
-1/2 

(8 2 xi% + 6 2 4-\ t )/2 

,i+2 
r-2,t- 



if r 
if r 
if r 
if r 
if r 



4, 
4, 
4, 
4, 
4. 



if r > 2t, 
if i > 3. 



>4, 
= 3, 
= 2, 

= 1, 
= 0, 



(2) 



We prove that if r is divisible by 4 then 6 2 x® 2 < and for all i > 2 we have 5 2 x® t < 0. These 
two facts prove that for t > 2 we have 



6 x r,t 



Sx r -y ~t~ ^ ^ (5 X r j <C (5x r ^ , 



which is precisely the last statement of the claim. 

For integers i, r > 0, t such that r > 2t define w % rt to be the number of one dimensional walks 
of length r with t double steps and r — 2t unit steps that start in and end in i. Observe that 
w l rt = w~l- For even r > 4, expanding Equation 2 for 6 2 x^ t exhaustively yields the following 
expression. 



2^0 



5 z x 



r.t 



1 



-W 



1 

/— 4,t-2 + T w 



r-4,i-2 



1 

+ 7«l. 



r-4,t-2 



w r-4,t-2 + w r-4,t-2 



ir-3 



Hence to prove the statement of the claim it suffices to show that if r is divisible by 4 then 
w® > w 2 and w® t > w 2 t for t > 1. For non-negative i, the number of walks satisfies the 
following recurrence. The number of walks satisfies the following recurrence. 



w 



r.t 



r-l,t ^ w r-l,t 
r-2,t-l + W r-2,t-l 



if r = 0,i = 0, 
if r = o,i / 0, 
if r > 2i, 
if t > l,i > 2 



(3) 



It is easy to see that when r is even, w 2l = (rlj. Since (M > LI.]) when cc < r/2 it follows 
that 



w 2i q > tu^Q +1 ^for all i 



Equation [i] directly implies 



w r,0 > w r,0- 
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(4) 



It remains to prove that if r is divisible by 4 then w® t > w^ t . For the case that r = 2t, 
expanding Equation [3] exhaustively yields the following expression. 

„, f(( 2 ,-'„/4) «i = *H« (5) 
I otherwise 

Most importantly, if < i < i' and w l 2t t is non-zero, then w 2t t > w 2t f 

We now prove that when r — 2t is an even, positive integer and i > then > wf.+ . 
A special case of this inequality is that when r is divisible by 4 then w® t > w^ t . Observe that 
when t = the inequality follows by Equation [4} We prove the inequality by induction onr-i. 
Observe that when r decreases by 2 while t decreases by 1, r — t decreases. Hence for t > 1 and 
i > 1 we have 

,..2t _ 2i-2 , 2i+2 > 2i , 2i+4 _ 2(t+l) 

Now, for t > 1 and i = we have that = 2u>J?_ 2 t +2u^_ 2 t and = to°_ 2 t +2w^_ 2 t +wf_ 2 t - 
Hence to prove that w® t > ^ suffices to prove wj?_ 2i > wf_ 2t . If r — 2t = 2 then by 
Equation we have that either u^_ 2 t = wf,_ 2t = or w®_ 2t > wf,_ 2t . In both cases this 
implies w^l > w^ t . Finally, if r — 2t > 2 then the induction hypothesis yields 

w° t = 2w° r _ 2 t + 2w 2 r _ %t > w®_ 2 j + 2w 2 r _ u + wj_ 2:t > w* tt . 
Hence, when r is divisible by 4 then w^ t > w^ t , concluding the proof of the claim. □ 
Set A = mini< n »(fo°» il - <5x°,J. By Claim [1} A > 0. Define 

E l no =Q-£ 2 -(n*/2-x° n , A )+£ 2 A (6) 

Observe that if t 2 > ll ' k '(™ / 2 ) then E\ Q > E yes . Selecting i 2 slightly larger than this will ensure 
the desired gap between E\ Q and E yes , so we set 



■ i • 



k ■ n* 
A 



(7) 



Observe that the ratio between l 2 and t\ is a function of k. 



Lemma 3. Let S* be a subset of S of size n* , where the corresponding edge set E* has the 
property that \E* n E p „\ ^ 1 for at least one pair p, q < k. Then E[d(S*, c(S*))] > E\ Q . 



Proof. For any position j in an edge block number p, q, d(S*\j], c(S*)[j]) is distributed as 
^n*t\pq\ wnere APiQ\ ls the number of edges e in E p<q such that for both endpoints of e the 
strings corresponding to them are in 5*. It follows that E[d(S*, c(S*))j > £ 2 ■ q x ° n * t\p q] smce 
here we are just counting the contribution of the edge block positions to the expectation. Since 
1 5* | = n* it follows that Y^ P q x ° n * t[p q\ — (2)' ^ e can now use Claim jlj to lower bound the 
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expectation of d(S*,c(S*)). In particular, we have that 

E[d(S*,c(S*))} K/2 - = E (4 • (n*/2 - x° v + x n . A - 4v M )) 

P.9 P,9 

• ^ 2 • (n*/2 - +£ 2 • (Q 4m - I>°V M ) 

P>9 

• ^ 2 • (n*/2 - 4m) +^2 • ((f) 4m " (f)4*,o -EE 6x n*t) 

• ^ 2 • (n*/2 - 4m) + ft • ( (2) ^4m "EE 

P,? t=l 

Observe that if t[p, q] = 1 for all p, q then 

*[p>«] /,\ 

EE^v = uKv 

P,9 *=1 V 7 

and the second term of the last equation cancels. However this is not the case, since S* is 
assumed not to correspond to the set of endpoints of a set of edges that intersects with every 
Ep >q . It follows that 

t\p,q] /,\ 

p,q t=l ^ 7 

which in turn implies that 

E[d(S*,c(S*))} > Q) • ft ■ (n*/2 - 4m) + ft A = i& 

□ 

By Lemma [3] we know that any set S* such that E[d(S* , c(S*))] < E\ corresponds to all 
the endpoints of an edge set E* such that for every p, q < k we have E* nE Pjq 7^ 0. It remains to 
prove that if E* does not correspond to the edge set of a /c-clique in G then E[d(S*, c(S*))] > E^ 
for an integer E\ Q which is sufficiently large compared to E yes . Observe that for each i < k 
there are exactly k — 1 edges in E* that are incident to vertices of V%. What we prove is that 
if E[d(S* , c(S*))] < E\ then for every i, the set of edges in E* that have an endpoint in Vi 
all come from the same vertex. Just as for the proof of Lemma [3] we need a preliminary claim 
about the properties of certain random walks. Let V be the set of all vectors with all positive 
integer entries such that the sum of the entries is exactly n* and the sum of all the entries that 
are not 1 is at most k—1. Let V' = V\u where u is the vector in V with one entry equal to k— 1. 
Observe that for this choice of u, E[\X^\] = cr ^ e t ^2 = min^ e v 4*"-fc+l ~~ ^U^QH] 

Claim 2. For every v £ V' , 4*"— fc+i — -^[l^^l] ^ 0- Hence A 2 is positive. Furthermore, A 2 
can be computed in time f(k) for some function f. 

Proof. To see that A 2 can be computed in time f(k) for some function / it is sufficient to 
observe that the size of the sample space of any variable is bounded by a function of k 
and that |V| is upper bounded by a function of k as well. We now prove that A 2 is positive. 
Consider a vector v £ V and let v 1 be a vector that contains all the entries of v that are greater 
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than 1, and possibly some entries that are 1 such that the sum of the entries in if is exactly 
k — 1. Conditional expectation yields: 

fc-1 
i=-k+l 

Observe that for every i whose parity is not the same as k — 1 we have that P[X$i = i] = since 
every entry of if is either added or subtracted to get X$i and — 1 = 1( mod 2). Furthermore, 
a simple induction (see below) shows that for every non-negative i' such that < \i\ — 2, 
x\ o > x* g. Together these two facts imply that for all i ^ {k — 1, —k + 1} for which P[X$t = i] 
is non-zero we have x l n ,_ k+1 < x^^ k+1 . Since such an i must exist for any v G V we have 

thatx^ fc+li0 -£[M >o. 

Finally, we have to prove that for every non-negative i' < i — 2, x* > x\ Q . We do this by 
induction on r. For r = this clearly holds as Xq = For r > 1 conditional expectation 
yields that x* )? = l/2(x^} lt0 + x^) > l/2(x;'Zi >0 + 4-lfi) = 4'o if *' ^ and <o = 
l/2(x i r r\ + z*t\ (0 ) > l/2(2xj_ 10 ) = x*' if i' = 0. This concludes the proof. □ 

We are now ready to prove the last part of the reduction. Let 

El = Q • ^ ■ in* 12 - z° v ) + k -h ■ (n*/2 - x k ~\ +lfi ) + hA 2 . (8) 

Lemma 4. Let S* be a subset of S of size n* that corresponds to all edge endpoints of a set E* 
of edges such that for every p, q < k we have \E* n E p>q \ = 1. If there exist two distinct vertices 
vi,V2 6 V% such that E* contains edges incident to both v\ and v 2 then E[d(S* , c(S*))] > E^ Q . 

Proof. Consider a set S* satisfying the conditions of the lemma. The contribution to E[d(S* , c(S*))] 
of the edge blocks is exactly ( 2 ) • l 2 ■ (n* /2 — l ). Now, consider a vertex block i such that 
there is exactly one vertex Vi E Vi that is incident to edges of E* . This block contributes exactly 
l\ ■ (n*/2 — x^^_ k+10 ) to E[d(S* ,c(S*))]. Finally, consider a block i such that there are two 
distinct vertices vi,v 2 G Vi such that E* contains edges incident to both v\ and v 2 . Let v be 
a vector that for each vertex v £ Vi which is incident to at least one edge in E*, contains an 
entry which is exactly the number of edges in E* that v is incident to. For each edge which is 
not incident to any vertices in V, the vector v contains a entry with value 1. Hence the sum 
of the entries of v is re*, the sum of all entries in v that are greater than 1 is at most k — 1, 
and v contains no entry which is k — 1. This is because exactly k — 1 edges are incident to 
vertices in Vi and two such edges are incident to distinct vertices. Hence v G V. Now, observe 
that the vertex block % contributes exactly h ■ (n*/2 - E[\X ff \}) to E[d(S* , c(5*))]. By Claim[2j 
E[\Xt\]<x k n ^ k+lfi -A 2 . Thus, 

E[d(S*,c(S*))] > Q) • h ■ (n*/2 - x° %1 ) + k-h- (»*/2 - x^ fc+1>0 ) + li^ 2 = E 2 no 

□ 

Set E no = m\n{El ,El ) = E 2 no . From Equations [lj [ej § and we conclude that there 
exist constants K ye s, K no and kl depending only on k such that E yes = K yes l\, E no = K no l\ 
and L = ki,1\. Furthermore, n yes < n no and the value of K yes , K no and kl can be computed in 
time f(k) for some function /. Set n' yes = (2n yes + K no )/3 and n' no = {n yes + 2K no )/3. Then 
Kyes ^ Ky es < K! no < K no . We set D yes = n yes ^i and D no = K no £\. 
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A randomized analogue of Theorem [4| Before proving Theorem [4] we argue that the 
randomized construction works. Specifically, we show that if Min-distance Consensus String 
With Outliers has an EPTAS then MCC has a randomized FPT algorithm, implying that W[l] 
C randomized FPT. The results proved in this section are not used in the proof of Theorem |4j 
but they provide useful insights on how the deterministic construction works. 



Lemma 5. For any S* C S such that \S*\ = n* 



\d(S* ,c(5*)) - E[d(S* ,c(S*))}\ >x-£i 



< 2exp 



x 2 



K L (n 



*\2 



Proof. We have that d(S*,c(S*)) = £p = i d(£*[p], c(£*)[p]). The d(S*\p], c(S*)[p])'s are in- 
dependent random variables taking values from to n* . Since L = it follows that 
P[\d(S*,c(S*)) - E[d(S*,c(S*))]\ >x-h} = P[\d(S*,c(S*))-E[d(S*,c(S*))}\ > £ ■ L}. By 
Hoeffding's inequality (Proposition [I]) it follows that 

\d(S*,c(S*))-E[d(S*,c(S*))]\>^--L <2ex P (-2(-^ L 

x 2 




n L (n*)' 1 



□ 



We now define l\. This value for l\ is only valid for the randomized construction, and a 
different value for l\ is used in the proof of Theorem |4j 



2(< 



yes 



In 20(2m) n . (9) 



Recall that m is the number of edges in the graph G, so m < n 2 and hence l\ < / • logn for 
some / depending only on k. 

Lemma 6. If G has a k-clique C , let S* be the set of strings corresponding to edge endpoints 
of edges in C. Then P[d(S* , c(S*)) > D yes ] < 10( - 2 ,^n* • If G does not contain a k-clique, then 
the probability that S contains a subset S* of size n* such that d(S*,c(S*)) < D no is at most 
1/10. 

Proof. If G has a /c-clique C, let S* be the set of strings corresponding to edge endpoints of 
edges in C. Then by Lemma[2j E[d(S* , c(S*))] = D yes . Now, D yes — E yes = (K' yes — K yes )£i and 
hence, by Lemma [5j 

P[d(S*,c(S*)) > D yes ] < P[\d(S*,c(S*)) - E yes \ > ( K ' yes - K yes )h] < 1 * 

On the other hand, consider a set S* of size n* that does not correspond to the edge endpoints 
of a clique. If S* does not correspond to a set E* of edges such that \E* n E Pt q\ = 1 for 
every p,q, then E[d(S* , c(S*))] > E\ a > E no . If S* corresponds to a set E* of edges such that 
\E* n E Pt q\ = 1, but E* is not the edge set of a clique in G then there exists an i and v±, 
v 2 £ Vi such that E* contains edges incident to v\ and to 1)%. In this case Lemma |4] yields that 
E[d(S*,c(S*))} > E 2 no = E no . Hence E[d(S* , c{S*))] > E no . Finally, E no - D no = ( Kno - K ' no )h 
and (K no — K' no )£i = {Ky es — K yes )£i and hence, by Lemma [5| 

P[d(S*,c(S*)) < D no ] = P[E no - d(S*, C (S*)) > (K yes - Kyes)h] < 1Q(2 ^ )n * • 

Thus, if G does not contain a clique of size k, the union bound yields that the probability that 
S contains a subset S* of size n* such that d(S*,c(S*)) < D no is at most 1/10. □ 
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We now prove a randomized analogue of Theorem [4} 



Lemma 7. If Min-distance Consensus String With Outliers has an EPTAS then W[lJ C ran- 
domized FPT. 

Proof. Assiming that Min-distance Consensus String With Outliers has an EPTAS we give a 
randomized fixed parameter tractable algorithm for MCC with two sided error. We construct 
the instance to Min-distance Consensus String With Outliers as described and run the EPTAS 
with e = #^ - 1 = 4^ - 1. If the EPTAS returns a set S* such that d(S*,c(S*)) < D no the 

U VRS K yes 

algorithm returns that the input graph G contains a fc-clique, otherwise we return that G has no 
^-clique. The construction takes time 0(f(k)n ^) for some function /, and e depends only on 
k. Hence the EPTAS runs in time g(k)n c for some function g. Thus the algorithm terminates 
in FPT time. 

If G contains a A;-clique, then by Lemma 6 with probability at least 1 — j^—w* > 1 — ^ 
there is a set S* of size n* such that d(S*,c[S*)) < D yes . If this event occurs, the EPTAS 
will find a solution S' such that d(S' , c(S')) < D yes {l + e) < D no and hence the algorithm will 
correctly return "yes" . Hence the probability of false negatives is at most ^ . 

If G does not contain a /c-clique, then by Lemma [6j with probability at least 9/10 for every 
set S* of size n* we hae d(S*,c(S*)) > D no . If this event occurs the algorithm correctly returns 
"no" and hence the probability if false positives is at most 1/10. This implies that there is a 
randomized fixed parameter tractable algorithm for MCC, which in turn shows that W[l] C 
randomized FPT. □ 



A Deterministic Construction and Proof of Theorem [4[ In order to prove Theorem [4] 
we need to make the construction deterministic. We only used randomness to construct the set 
Z, all other steps are deterministic. We now show how Z can be computed deterministically 
instead of being selected at random, preserving the properties of the reduction. For this, we need 
the concept of near p-wise independence defined by Naor and Naor |32| . The original definition 
of near p-wise independence is in terms of sample spaces, we define near p-wise independence in 
terms of collections of binary strings. This is only a notational difference, and one may freely 
translate between the two variants. 

Definition 1 (|32j). A set C = {c\,C2, ■ ■ ■ cj} of length £ binary strings is (e,p) -independent if 
for any subset C' of C of size p, if a position i < t is selected uniformly at random, then 

\P[C'[i\ = a]-2-P\<e. 

ae{o,i}p 

Naor and Naor |32| and Alon et al. [2] give determinsitic constructions of small nearly k- 
wise independent sample spaces. Reformulated in our terminology, Alon et al. prove a slightly 
stronger version of the following theorem. 

Theorem 5 (|2j). For every t, p, and e there is a (e,p) -independent set C = {c\, C2, . . • ct} 
of binary strings of length £, where £ = 0( 2 ' ° s * )- Furthermore, C can be computed in time 
0(\C\°^). 

We use Theorem [5] to construct the set Z. We set 

_ K yes ~ K yes 

kl ■ n* 

and construct an (e, n*)-independent set C of 2m strings. These strings have length £ = /-log(n) 
for some / depending only on k, and C can be constructed in time 0{gn°^>) for some g 
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depending only on k. We set l\ = t. Observe that since £2 is an integer multiple of l\, the length 
of the strings in Z is an integer multiple of i\. For every i we set z% = CjOCjO. . .ocj, where we used 
kl copies of Ci such that z\ is a string of length L. The remaining part of the construction, i.e 
the construction of S from Z remains unchanged. To distinguish between the deterministically 
constructed S and the randomized construction, we refer to the deterministically constructed 
S as Sdet- We now prove that for every S* det C S det of size n*, if S* is the set of strings 
in the randomized construction that corresponds to the same edge endpoints as S deV then 
d(S* deV c{S* det )) is almost equal to E[d(S*, c(5*))]. 

For a subset I of {1, 2, . . . , 2m} define S*(I) = {si £ S : i £ 1} and S* det {I) = {sj G 
-Sdet : i G I}- For every j < k^, define Pj = {kl • j + 1>^L • (j + !)}• Hence for every 
i and j, = c%- The construction of S^et (and S) from Z implies that for every j, the 

substring there exists a function fj : N — ?■ N such that for any i < 2m, Sj[-Pj] = Zfu\[Pj\. For 
any / C {1, 2, . . . , 2m} and j < kl we define Z*(I,j) = : « G -/"}• This means that for a 

subset I C {1, 2, ... , 2m} of size n*, the set Z*(I,j) is the set of n* strings in Z which 
and S det (I)[Pj] depend on. For every set / C {1,2, ...,2m} of size n* and integer j < kl 
define dj : {0, l} n * — > {0, 1, . . . ,n*} to be a function such that for any p G Pj, if Z*(I) = a 
then d(S*(I)\p],c(S*(I)[p})) = ^(a) and d(S* det (I)[p], c(S* det (I)\p})) = dj(a). Since S*(I)[p] 
depends in exactly the same way on Z*(I)\p] for all p G Pj the function dj is well defined. For 
every set I C {1, 2, . . . , 2m} of size n* and integer j < kl we have the following expression for 
d(S*(I) det [Pj],c(S*(I) det [Pj})). 

d{S* det {I)[PjU{S det {I)[P j ]))=h- P[Z*{I)\p] = oi\-d I j{a) (10) 

ae{0,l}"* 

Here the probability = a] is taken when p is selected from Pj uniformly at random. 

For the randomized construction we have that P[Z* (I)\p\ = a] = which yields the following 
expression. 



e [d(s*(i)[Pj],c(s*(i)[Pj]))] =h- £ -L • 4 



ae{o,i} n * 



Combining Equations 10 and 11 yields the following bound 



d(surm].c(s% t (i)[p,])) - E[d(s , (i)iPA,c{s'{i)[Pj\))] 



E 

«6{0,1}' 1 * 
ae{0,l} n * 



P[Z*(I)[p]=a] 
P[Z*(I)[p]=a] 



1 

2 n * 



dUa) 
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• n 



Summing Equation 12 over < j < kl yields the desired bound for every / C {1, 2, . . . , 2m} of 
size n* . 



d(S* det (I),c(SUl))) - E[d(S*(I),c(S*(I)))] <l x -K L -e-n*<l x -{K 



K 



yes) 



(13) 



Equation 1 1 3| allows us to finish the proof of Theorem |4| For any set S* of size n* that corresponds 
to a clique in G, we have that E[d(S*(I), c(S* (/)))] = E yes = £iK yes , and so by Equation 



13 



d (S det (I) , c(S det (I))) < £iKy es = D yes . For any set S* of size n* that does not correspond 



to a clique in G, we have that E[d(S* (I) , c(S* (I)))] > E no = £in no , and so by Equation 13 
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d (Sj et (I) , c(Sj et (I))) > £\d no = D no . Since jf 3 - > 1 + 5 for some 5 depending only on k, an 
EPTAS for Min-distance Consensus String With Outliers can be used to distinguish between 
images of "yes" instances of MCC and images of "no" instances of MCC in time f(k)n°^ for 
some function /. Hence Min-distance Consensus String With Outliers does not have an EPTAS 
unless FPT=W[1], concluding the proof of Theorem |4j □ 

4 Parameterized Intractability Results 

From Theorem[4]we can extract intractability results for various parameterizations of Consensus 
String with Outliers. In the proof of Theorem [4] we reduced instances of MCC to an instance of 
Consensus String with Outliers where the size n* of the solution sought for is k ■ (k — 1). Here k 
is the size of the clique sought for in the MCC instance. Thus a FPT algorithm for Consensus 
String with Outliers parameterized by n* would give an FPT algorithm for MCC. This proves 
the following theorem. 

Theorem 6. Consensus String with Outliers is W[l]-hard when parameterized by n* , even when 

£ = {o,i}. 

Since an EPTAS for a problem implies an FPT algorithm for the problem parameterized by 
the value objective function [3T], Theorem [6] immediately implies that Consensus String with 
Max Non-Outliers does not admit an EPTAS unless FPT=W[1]. This means that in some sense 
the PTAS provided in Theorem [3] is the best we can hope for. 

In the context of error correction for DNA fragment assembly, we expect the number k 
of outliers to be reasonably small. A simple brute force algorithm for Consensus String with 
Outliers that tries all (^) subsets of S of size n* works in 0(n k+0 ^£) time. It is interesting 
whether we can significantly improve over this algorithm, in particular whether Consensus String 
with Outliers is FPT when parameterized by k. Using Theorem [6] as a starting point, we show 
that Consensus String with Outliers parameterized by k is W[l]-hard, even when the alphabet 
is binary. 

It will be convenient to consider a set of strings S = {si, . . . , s n } of n length-^ strings as a 
n x £ matrix, then, the ith column of S is the vector [s±[i], . . . , Sn^]] 7 ". An instance of Consensus 
String with Outliers is given by a set S of n length-^ strings with input parameters k and d. We 
assume t > 2 and d > 2 since I < 1 and d < 1 produce trivial cases. We describe how to generate 
a Consensus String with Outliers instance with a set S' of n' strings of length £' and parameters 
d' and k', where there exists a subset S' a of n* outlier strings such that d(S' / S' Q , c(S' / S' )) < d' 
if and only if there exists a subset S* of n* non-outlier strings to the original instance such that 
d(S*,c{S*)) < d. Let n' = 3n, k' = n*,d' = - n*£ + d + W£{n - n*) and £' = 111 The 3n 
strings are generated as follows: 

1. For each string Si G S there exists a 11^- length string s^ G 5', where the first £ symbols 
of s'j are equal to si and the remaining 10£ symbols of s[ are equal to 1. We denote these 
subset of strings of S' as S' org 

2. The remaining 2n length- 11£ strings are constructed so that each of the first £ columns of 
S' contain an equal number of positions equal to and positions equal to 1. The last W£ 
positions are equal to 0. 

Lemma 8. For a Consensus String with Outliers instance containing a subset S* of n* non- 
outlier strings that satisfy d(S* ,c(S*)) < d, the previous construction produces an instance with 
a set S' Q of k' outlier strings that satisfy d(S' / S' a , c(S' / S' Q )) < d' . 
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Proof. Without loss of generality, assume c(S*) is equal to e . Let S' Q be the n* strings of S' org 
corresponding to S* . Then we claim d(S' / S' a , 11£ ) < d' . Since there exists 2n strings equal 
to 0, and n strings equal to 1 at the last 10£ positions, it follows that c(S'/S' )[i] = for all 
I < i < 111. By our assumption that S' a is equal to S* it follows that the contribution of these 
101 positions to d(S' / S' Q , O 11 ^) is 10l(n — n*). Now consider the first £ positions, which we remind 
the reader that they contain an equal number of 0's and l's. Since we eliminate n* strings from 
S' the contribution to d(S' / S' Q , ) is at most £ — n*) + d, concluding the proof. □ 

For the reverse direction, we need to prove that the existence a subset S' a of n* outlier strings 
in S' that satisfy the constraint d(S' / S' a , c(S' / S' )) < d' , implies the existence of subset 5* of n* 
strings in 5 that satisfy the constraint d(S*,c(S*)) < d. 

Lemma 9. The k! outlier strings in S' correspond to a subset S* of n* non-outlier strings in 
S where d(S*,c(S*)) < d. 

Proof. Let S' Q be a set of n* outlier strings in S' that correspond to the minimum distance, 
i.e. d(S'/S' ,c{S'/S' )) < d(S' / S% , c(S' / S%) for any subset S£ that is not equal to S' D . Since 
there are 2n strings of S' that are equal to in the last 10£ positions, and n strings of S' 
that are nonzero at these positions, it follows that c(S'/S' )[i] = for all £ < i < 111. We 
argue that S' a is contained in S' org . From the pigeonhole principle it follows that there exists at 
least one string, say s^, that is not in S' Q but is contained S' org . For contradiction, we assume 
there exists a string, say s' 2 , that is contained in S' a but not contained in S' org . Note that 
d(s2, c(S'/S' )) < d(si, c(S' / S' )) because at the last 101 positions we have: c(S'/S' ) equal to 
0, s\ equal to 1, and S2 equal to 0. Let S = {S' / S' Q } / s\ n S2- By definition of c(S), we have 
d(S,c(S)) < d(S , c(S' I S' Q )) , which can be bounded as follows: 

d(S,c(S))< d(S,c(S'/S' )) 

= d(S'/S' ,c(S'/S' )) - d( Sl ,c(S'/S' )) + d(s 2 ,c(S'/S' )) 
<d(S'/S' ,c(S'/S' )). 

We contradict the fact that S'/S' is a minimal solution solution and all outlier strings in 
S' Q are contained in c(S). The last 101 positions will have at least 10£(n — n*) mismatches with 
c(S" / S' ) and it follows that the bound X^'eS'/S 7 ^(O 11 ^ s i) — d 1 is achieved when d(S' a , c(S' )) < 
d. '" ' □ 

Our main theorem follows direction from Lemma [H and Lemma HI 

Theorem 7. Consensus String with Outliers is W[l]-hard when parameterized by k, even when 

£ = {o,i}. 

5 Parameterized Tractability Results 

In this section, we prove Consensus String with Outliers is fixed-parameter tractable with respect 
to the parameter 5 = d/n* when the alphabet size is bounded by a constant. For the remainder 
of this section, we make the assumption that 5 > 0; otherwise Consensus String with Outliers 
can trivially be solved in polynomial time. The algorithm and analysis are nearly identical 
to that demonstrating Consensus Patterns is fixed-parameter tractable with respect to the 
parameterization 5 = d/n and bounded alphabet size [29 1, where 5 is the average error between 
the consensus string and the length-^ substrings s^. 

First, we define some terms and notation that will be used in this section. A hypergraph 
G = (Vg,Eq) consists of a set of vertices Vg and a collection of edges Eg, where each edge is 
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a subset of Vg- Given two hypergraphs, H = (Vh,Eh) and G = (Vq,Eg), we say H appears 
at V' C Vg as a partial hypergraph if there is a bijection n between the elements of Vh and V' 
such that for every edge E 6 there exists and edge ir(E) G (where the mapping it is 
extended to the edges the obvious way). For example, if H has the edges {1,2}, {2,3}, and G 
has the edges {a, b}, {b, c}, {c, d}, then if appears as a partial hypergraph at {a, 6, c} and at 
{b,c,d}. Given two hypergraphs, H = (Vh,Eh) and G = (Vg,Eg), we say that H appears at 
V C Vq as subhypergraph if there is such a bijection 7r where for every edge e 6 there is 
an edge e' 6 .Eg with 7r(e) = e' Pi V. For example, let the edges of H be {1,2}, {2, 3}, and let 
the edges of G be {a, c, d}, {b, c, d}. 

An edge cover of -ff is a subset E' C E# such that each vertex of Vh is contained in at least 
one edge of E ' . The edge cover number p(H) is the size of the smallest edge cover in H. A 
fractional edge cover is an assignment ^ : Eh — > [0,1] such that Y1e-v£E ^{E) — 1 f° r every 
vertex v. The fractional cover number, denoted as p*(H), is the minimum of Y1eee h ^(E) 
taken over all fractional edge covers *S>. 

Marx [29] demonstrated Consensus Patterns can be solved in f(5) • n 9 by constructing a 
hypergraph G from the Consensus Patterns instance, defining a combinatorial characterization 
of a solution to the instance with respect to the hypergraph respresentation, and enumerating 
(efficiently) over all subhypergraphs in G with the defined combinatorial characterization. It 
is shown that hypergraphs having at most 5 vertices and at most 200 log 6 edges need to be 
considered (Proposition 6.3 in [29]), and that any edge of size greater than 205 can be removed 
from G and all subhypergraph corresponding to a solution to the original Consensus Patterns 
instance can be retained, if they exist. The enumeration step is completed by considering all 
possible hypergraph with at most 5 vertices and at most 200 log S edges, and for each such 
hypergraph, Ho, determining every place where Hq appears in G as a subhypergraph. This 
paradigm for solving the Consensus Patterns problem makes use of an efficient algorithm for 
finding all the places V C Vg in G where H appears as hypergraph for two given hypergraphs 
H = (Vh,Eh) and G = (Vg,Eg). The result of Marx [29], which proves a tight upper bound 
on the time required to perform this enumeration step, is essential. 

The following result by Friedgut and Kahn [T7] gives a bound on the maximum number of 
times a hypergraph H = (Vh,Eh) can appear as partial hypergraph in a hypergraph G with 
m edges, i.e. the maximum number of different subsets V C Vq where H can appear in G. 

Theorem 8. Let H be a hypergraph with fractional cover number p*(H), and let G be a 
hypergraph with m edges. There are at most |Vff|'^ H ' • m p *( H ^ different subsets V <Z Vg such 
that H appears in G at V' as partial hypergraph. Furthermore, for every H and sufficiently 
large m, there is a hypergraph with m edges where H appears m ■ p*{H) times. 

Marx [29 extended this theorem by giving a bound on the running time required to enu- 
merate through all possible partial hypergraphs of a given hypergraph G. In particular, if H is 
a hypergraph with fractional cover number p*{H), and G is a hypergraph with m edges and the 
size of each edge is at most t then hypergraph H can appear in G as subhypergraph at most 
\V H \\V H \-i\V H \-p*(H)-m-p*(H) times. Given hypergraphs H = (V H ,E H ) and G = (V G ,E G ), 
if there are t places in G where H appears as subhypergraph then obviously we cannot enu- 
merate all of them in less than t steps, however, there exists an algorithm that performs this 
enumeration in time that is polynomial in the upper bound |Vff||Vfl-| • £\Vh \ • P*{H) ■ m ■ p*{H). 
We refer to this algorithm as Find- Subhypergraph. 

Theorem 9. 1291/ Let H = (Vh,Eh) be a hypergraph with fractional cover number p*(H), and 
let G(Vh,Eh) be a hypergraph where each edge has size at most I. There is an algorithm that 
enumerates in time \VH\ 0{yii) • £\ v h\p*(h)+i . \e g \p*(h)+i . |y G |2 every su ]j Set y q V g where H 
appears in G as a subhypergraph. 
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Given a Consensus String with Outliers instance with a set S of n length-^ strings and 
integer n* , we define a minimal solution for this instance as a set and length- £ string s m , 
where Y\ c g» d(s m ,Si m ) is minimal. 

Theorem 10. Consensus String with Outliers can be solved in time 5°^ ■ \T,\ S ■ n 9 

Proof. Let {S, k,d} be an instance of Consensus String With Outliers with solution S* and s 
denote the consensus string corresponding to S* . Clearly, d(s,s*) < S for at least one s* G S* 
and thus, if there exists a solution to a consensus string for S* then it can be found by considering 
all so G S and checking if any string that has distance at most 5 from so is a consensus string 
for some subset of strings of S of size n* . Next, we show how to perform this analysis for one 
particular string sq G S. It follows that since there are at most n possibilities for choosing so, 
the running time of our algorithm for Consensus String with Outliers will be the running time 
of the following algorithm multiplied by a factor of n. 

Given so G S, we construct a hypergraph G = (V,E), where V = {v\, V2, ■ ■ ■ ,vi} and the 
edge set describes the possible strings in the set of non-outlier strings of S. For each Sj G S, 
there exists an edge G E if and only if the symbol at the position k of so is not equal to the 
symbol at the position k of Sj. Clearly, G has at most n edges. Suppose S* is a solution to the 
original instance then we denote H = (Vh,Eji) as the partial hypergraph in G that contains 
the n* edges corresponding to the strings in S* . 

Let S^ and s m be a minimal solution to our original instance. Denote P as the set of 
positions where s m and sq differ and let Hq be the subhypergraph of H induced by P, i.e. the 
vertex set of Hq is equal to the vertices corresponding to the positions in P, and for each edge 
e G E there is an edge E n P in Hq. Since Hq is a subhypergraph of H and H is a partial 
hypergraph of G, it follows that Hq appears in G at P as a subhypergraph. The following 
proposition shows the fractional cover number of Hq is at most 5/2 since the definition of a 
minimal solution to Consensus Patterns is identical to our definition of a minimal solution to 
Consensus String with Outliers. 

Proposition 2. [29] Let {S^,s m } be a minimal solution to a Consensus Patterns instance, 
then the hypergraph Hq corresponding to {S^, s m } has fractional cover number at most 5/2. 

We can find all possible places P by enumerating every suitable hypergraph Hq and using 
Theorem [9] to find all places where Hq appears in G as a subhypergraph. In order to adequately 
bound on the running time indured by using the algorithm corresponding to Theorem |9j a 
bound on the size of the edges in G is required. It follows from the work of Marx [29] that we 
can remove every edge of size greater than 20(5 from G (and H respectively). Let G* (and H* 
respectively) be the resulting hypergraph and Hq be the subhypergraph of H* induced by P. 
Since Hq is subhypergraph of G* and the fractional edge cover number can be bounded by a 
constant (Proposition^, we can find all the possible places P by enumerating every hypergraph 
Hq on 5 vertices having fractional cover number at most 5/2 and finding every place in G* where 
Hq appears. The following proposition demonstrates that we only need to consider hypergraphs 
that have 0(log<5) edges, further restricting the hypergraphs that need consideration. 

Proposition 3. [29] Let {S^,s m } be a minimal solution to a Consensus String with Outliers 
instance, and Hq is the corresponding hypergraph, then it is possible to select 200 log 5 edges 
of Hq in such a way that if we delete all other edges, then the resulting hypergraph Hq* has 
fractional cover number at most 5. 

There are n possible choices for so in the first step and the remainder of the algorithm checks 
whether there is a consensus string that differs from so in at most 5 positions. Constructing the 
hypergraph G* can be done in 0(£n) time. Since the aim is to find strings s where d(sQ, s) < 8, we 
can assume that Hq* has at most 8 vertices; there are at most 2 slogS = 2°^ unique hypergraphs 
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Algorithm 1 Consensus String with Outliers d-Parameterization Algorithm 
1: For each string sq G S: 



2: Construct the hypergraph G* on {1,2, ... ,£}. 

3: For each hypergraph Hq* having < 5 vertices and < 200 log 5 edges: 

4: If every vertex of Hq* is covered by at least 1/5 part of the edges then: 

5: For every place P where Hq* appears in G* as a subhypergraph: 

6: For every string s that differs from so at the positions corresponding to P: 

7: Let S* C S of size n* , where d(s , s-) < d{s , Sj), VsJ G S*, V Sj G S/S*. 

8: If d(s, sj) < 5, for all s' { G 5* then: 

9: Return sq and 5*. 



10: Return "no solution" and halt. 



Parameter(s) 


X is bounded 


S is unbounded 


£, d, n* 


FPT 


W[l]-hard 


t 


FPT 


W[l]-hard 


n* 


W[l]-hard 


W[l]-hard 


k 


W[l]-hard 


W[l]-hard 


d 


FPT 


W[l]-hard 



Table 1: An overview of the fixed parameter tractability and intractability of the Consensus 
String with Outliers. 

with at most S vertices and at most 200 log 5 edges since there are at most 2 s possibilities for 
each edge. Therefore, Step 3 enumerates through at most 0(2°^ log<5 ^) hypergraphs. The test in 
Step 4 is trivial. Step 5 is performed using the Find- Subhypergraph corresponding to Theorem 
[9j It follows from the fact that the fractional cover number of Hq* is at most 5 and every edge 
of G* has size at most 205, that Step 5 takes S°^n e £ 2 time. If Hq* appears at P in G* as 
subhypergraph, then Step 6 considers at most possible strings and testing each string takes 
0(£n) time. Therefore, the total running time is n . 

□ 

6 Conclusions and Future Work 

We presented the Consensus String with Outliers problem with the aim to model error cor- 
rection of genomic data, and demonstrated that studying its parameterized complexity and 
approximability leads to surprising theoretical results. We studied the complexity of Consensus 
String with Outliers with respect to different parameterizations, Table [6] summarizes these re- 
sults. Majority of these results are proved using standard parameterized reductions and hence, 
we leave them to the Appendix. The most notable of these results demonstrates that Consensus 
String with Outliers parameterized by is FPT. 

Our results rule out the possibility of a (1+e) approximation algorithm that has running time 

0(f(l/e)n°^), while our PTAS has running time O ^n 1 ^ 4 ^. Hence there is still a significant 
gap between known upper and lower bounds for the running time of approximation schemes for 
the problem. Obtaining tighter bounds warrants further investigation. 

Another problem that is FPT parameterized by objective function value, admits a PTAS 
but is not known to admit an EPTAS is the Consensus Patterns problem |29] , which seems to be 
closely related to Consensus String with Outliers. It is quite possible that our results on random 
walks, and hardness proofs could be useful to rule out an EPTAS for Consensus Patterns, which 
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would answer an open problem given by Fellows et al. [14] . and for other problems as well. 
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7 Appendix 



We prove that when the alphabet size is unbounded Consensus String With Outliers is W[l]- 
hard for every combination of the parameters £, d, and n*. We define an instance of Clique 
by an undirected graph G = (V, E) with a set V = {v\,v 2 , • • • , v n } of n vertices, a set E of m 
edges, and a positive integer t denoting the size of the desired clique. We generate a set S of 
(l)\E\ strings such that G has a clique of size t if and only if there is a subset of S of size 
denoted as S*, where there exists a string x such that ^Vs t e5* d(s{,x) < d. We let £ = t and 
d = (2) (t — 2). We assume that i > 2 since t < 1 produces trivial cases. 

Theorem 11. Consensus String with Outliers with an unbounded alphabet is W[l]-hard with 
respect to the parameters £, d, and n* . 

Proof. We begin by describing the alphabet. We assume |S| can be infinite and we let S be 
equal to the union of the following sets of symbols: 

1. {vi\ for all i = 1, . . . , \V\}. Hence, there exists one symbol representing each vertex in G. 

2- {ci t j >m \i = 1, . . . , t; j = 1, . . . , t; m = 1, . . . , \E\}. There exists an unique symbol for each 
(2) • \E\ strings produced for our reduction. 

Hence, we have a total of |V| + (2) • \E\ number of symbols. 

We construct a set of Q\E\ strings S = {si,i,i, • • ■ , si,2,i, • • • , s 1>2> \ E \,- • ■ ,St-i,t,\E\}- 
Every string has length t and will encode one edge of the input graph. There will be Q) 
corresponding for each edge, however, encode the edges in different positions. For string Sij tm 
we encode edge e m = (v r ,v s ), where 1 < r < s < \V\, but letting position i equal to v r and 
position j equal to v s and the remaining positions equal to Cjj im . Hence, a string is given by 

s i,j,m '■= [Ci,j,m] V r [Cij ym Y f s [Cj^jn] J - 

To clarify our reduction, we give an example. Let G = (V, E) be an undirected graph with 
V = v±, v 2 , vs, V4 and edges E = {(«i, v 2 ), (vi, V3), (v±, V4), (v 2 , V3)} and let our Clique instance 
have G and t = 3. Using G, we exhibit the above construction of ■ \E\ = 12 strings, which 
we denote as S. We claim that there exists a clique of size 3 if and only if there exists a string 
s* of length £ = t = 3 and subset S* of S of size 3 where d(S*,c{S*)) < d. 

First, we show that for a graph with a clique of size t, the above construction produces an 
instance of Consensus String with Outliers with a set S* , consensus string c(S*) of length £ such 
that d(S*,c(S*)) < d. Let the input graph have a clique of size t. Let v ai ,v a2 , . . . ,v at be the 
vertices in the clique C of size t and without loss of generality, assume ct\ < a 2 < ■ ■ ■ < at- 
Then we claim that the there exists a subset of (2) vertices that have distance at exactly t — 2 
from the string s = v ai v a2 . . .v at . Consider the first edge of the clique (v ai ,v a2 ) of the clique 
then it follows that the string s±i r = v ai v a2 [cn r ]'~ 2 , where edge r has endpoints v ai v a2 , is 
contained in the set of strings {sm, sn 2 , . . . , Shiei}. Clearly, H(s\ir, s) = t — 2. For each edge 
in C we have we have a string in S that has distance t — 2 from s and our lemma follows from 
this construction. 

For the reverse direction, we need to prove that the existence a subset S* of size ( 2 ) , where 
d(S* ,c(S*)) < (l)(t — 2) implies the existence of a clique in G with t vertices. Let S* be the 
subset of S of size Q such that s has distance (l)(t — 2) from each string in S* . Since £ = t, 
n * = (2) ' ^ = (2) 2) and he symbol Cj Jim occurs in only a single string in S for alH = 1, . . . , t, 
j = 1, . . . , t and m = 1, . . . , \E\, it follows from the Pigeonhole principle that the consensus string 
only contains symbols from the set {vi\ for alH = 1, . . . , \ V\}. Without loss of generality assume 
the consensus string is equal to v ai v a2 ■ ■ ■ v at for a Vl , a V2 , . . . , a Vt £ {!,..., \V\}. Consider any 
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pair cti, a.j for 1 < i < j < t and the set of strings Sij = Si,j,2> • • • , s^-i^i}. Recall that 

Si j contains a string corresponding to each edge e = (r, s) in I£ which has v r at the ith position 
and v s at the jth position and Cij >m at all remaining positions. Therefore, we can only find a 
string in Sij that has distance t — 2 from s if v ai is at the zth position and v a - is at the jth 
position; and such a string exists if and only if there is an edge in G connecting v ai to v aj . 
Hence, the consensus string s implies there exists an edge between any pair of vertices in G in 
the set {v ai v a2 . . . v at } and by definition the vertices form a clique. □ □ 

Our main theorem follows directly from Lemma [8] and Lemma |9j We note that the hardness 
for the combination of all three parameters also implies the hardness for each subset of the 
three. 
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